Cosmos3 ModularPipeline by yzhautouskay · Pull Request #14110 · huggingface/diffusers

yzhautouskay · 2026-07-02T13:56:41Z

What does this PR do?

Summary

Add Cosmos3 modular pipeline support via Cosmos3OmniModularPipeline and Cosmos3OmniBlocks.
Implement modular Cosmos3 stages for encoding, pre-denoise setup, denoising loop, and decoding.
Register/export Cosmos3 modular pipeline components in modular and top-level package mappings.
Add Cosmos3 modular documentation and usage section in docs/source/en/api/pipelines/cosmos3.md.
Add strict elementwise parity tests across text/image/video, optional sound, and action-conditioned modes.

Test Plan

PYTHONPATH=src python -m pytest -q tests/pipelines/cosmos/test_cosmos3_modular_parity.py -vv

Before submitting

Did you use an AI agent (Claude Code, Codex, Cursor, etc.) to help with this PR? If so:
- Did you read the Coding with AI agents guide?
- Did you self-review the diff against .ai/review-rules.md?
Did you read the contributor guideline?
Did you read our philosophy doc? (important for complex PRs)
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?
Are you the author (or part of the team) of the model/pipeline (only applicable for model/pipeline related PRs)?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…json

yiyixuxu

thanks for working on this!
I did an initial review - I mainly focus on encoder/decoder blocks for now. In modular, these blocks are meant to be run standalone ( e.g. an user encode an image once, keep the latent and reuse them across generations), or combined into a pipeline you can run end-to-end like a standard pipeline.

i will do another pass soon, let me know if you have any questions!

yiyixuxu · 2026-07-03T01:41:23Z

+    @property
+    def expected_components(self) -> list[ComponentSpec]:
+        return [
+            ComponentSpec("transformer", Cosmos3OmniTransformer),


Suggested change

ComponentSpec("transformer", Cosmos3OmniTransformer),

yiyixuxu · 2026-07-03T01:44:57Z

+            ComponentSpec("vae", AutoencoderKLWan),
+            ComponentSpec("sound_tokenizer", Cosmos3AVAEAudioTokenizer),


Suggested change

ComponentSpec("vae", AutoencoderKLWan),

ComponentSpec("sound_tokenizer", Cosmos3AVAEAudioTokenizer),

yiyixuxu · 2026-07-03T01:46:29Z

+
+    @property
+    def description(self) -> str:
+        return "Validates inputs, tokenizes prompts, and packs text conditioning."


I think we can have this step to just run safety_checker + tokenize things, we want the text encoder block to be meaningful to run standalone, as well as combined into other blocks.

i.e., the user can run it once, keep the text segments, and reuse them across many generations with different resolutions/ conditional inputs/seeds etc

yiyixuxu · 2026-07-03T01:50:00Z

+            InputParam(name="image", default=None),
+            InputParam(name="video", default=None),
+            InputParam(name="condition_frame_indexes_vision", default=(0, 1)),
+            InputParam(name="condition_video_keep", default="first"),


Suggested change

InputParam(name="image", default=None),

InputParam(name="video", default=None),

InputParam(name="condition_frame_indexes_vision", default=(0, 1)),

InputParam(name="condition_video_keep", default="first"),

yiyixuxu · 2026-07-03T01:52:32Z

+            InputParam(name="guidance_scale", type_hint=float, default=6.0),
+            InputParam(name="enable_sound", type_hint=bool, default=False),
+            InputParam(name="action", type_hint=CosmosActionCondition, default=None),


Suggested change

InputParam(name="guidance_scale", type_hint=float, default=6.0),

InputParam(name="enable_sound", type_hint=bool, default=False),

InputParam(name="action", type_hint=CosmosActionCondition, default=None),

yiyixuxu · 2026-07-03T02:22:40Z

+        if isinstance(block_state.callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
+            block_state.callback_on_step_end_tensor_inputs = block_state.callback_on_step_end.tensor_inputs


Suggested change

if isinstance(block_state.callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):

block_state.callback_on_step_end_tensor_inputs = block_state.callback_on_step_end.tensor_inputs

we do not need to support pipeline callbacks in modular, since it is so easy to insert/swap blocks

yiyixuxu · 2026-07-03T02:24:12Z

+            if block_state.width is None:
+                block_state.width = 1280
+
+        components.check_inputs(


only need to check inputs used in this block (i think you cannot directly reuse the check_inputs method from the standard pipeline)

yiyixuxu · 2026-07-03T02:27:28Z

+            condition_frame_indexes_vision=block_state.condition_frame_indexes_vision,
+        )
+
+        block_state.action_mode = block_state.action.mode if block_state.action is not None else None


can we give action its own text block? a Cosmos3ActionTextStep that takes prompt + action and then build the action json prompt + resolution bining + tokenize ...

and then you can wrap this step( Cosmos3TextEncoderStep) and Cosmos3ActionTextStep into a AutoPipelineBlocks (e.g. Cosmos3AutoTextEncoderStep) triggered on action. this way each mode's text logic stays self-contained and more readable

see an example here: https://github.com/huggingface/diffusers/blob/main/src/diffusers/modular_pipelines/qwenimage/modular_blocks_qwenimage_edit.py#L200

this is an auto vae encoder step, but it should work similarly for text step as well

yiyixuxu · 2026-07-03T02:48:11Z

+
+
+logger = logging.get_logger(__name__)
+


Can you separate the VAE encoding from prepare_latent and add a proper Cosmos3VaeEncoderStep here?

We probably need a Cosmos3VaeEncoderStep (for i2v and v2v) and a Cosmos3ActionVaeEncoderStep, and pack them into an auto-step triggered on image/video/action.

similar to text step, the Vae encoder step should also be able to run standalone when needed - a user should be able to run just the vae encoder once, keep the latents and reuse them across generations.

yiyixuxu · 2026-07-03T02:54:47Z

+logger = logging.get_logger(__name__)
+
+
+class Cosmos3DecodeStep(ModularPipelineBlocks):


I think we should split by modality as well, so Cosmo3VideoDecoderStep and Cosmos3SoundDecoderStep(the sound one can go into an auto block so it only runs if sound_latents is not None, like https://github.com/huggingface/diffusers/blob/main/src/diffusers/modular_pipelines/z_image/modular_blocks_z_image.py#L231)

Similar to encoder steps, the user should also be able to run a decoder step in standalone - so each block should just decode latent + safety checker, nothing else

The action-related code in the current block isn't decoding - I think it can probably go into its own block

Cosmos3 ModularPipeline initial commit

ab03090

github-actions Bot added documentation Improvements or additions to documentation tests modular-pipelines size/L PR with diff > 200 LOC labels Jul 2, 2026

Fix from_pretrained for modular pipleine without modular_model_index.…

54bc611

…json

yiyixuxu reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cosmos3 ModularPipeline#14110

Cosmos3 ModularPipeline#14110
yzhautouskay wants to merge 2 commits into
huggingface:mainfrom
yzhautouskay:yzhautouskay/cosmos3_modular_pipeline

yzhautouskay commented Jul 2, 2026

Uh oh!

yiyixuxu left a comment

Uh oh!

yiyixuxu Jul 3, 2026

Uh oh!

yiyixuxu Jul 3, 2026

Uh oh!

yiyixuxu Jul 3, 2026

Uh oh!

yiyixuxu Jul 3, 2026

Uh oh!

yiyixuxu Jul 3, 2026

Uh oh!

yiyixuxu Jul 3, 2026

Uh oh!

yiyixuxu Jul 3, 2026

Uh oh!

yiyixuxu Jul 3, 2026

Uh oh!

yiyixuxu Jul 3, 2026

Uh oh!

yiyixuxu Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		ComponentSpec("vae", AutoencoderKLWan),
		ComponentSpec("sound_tokenizer", Cosmos3AVAEAudioTokenizer),

	InputParam(name="guidance_scale", type_hint=float, default=6.0),
	InputParam(name="enable_sound", type_hint=bool, default=False),
	InputParam(name="action", type_hint=CosmosActionCondition, default=None),

		if isinstance(block_state.callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
		block_state.callback_on_step_end_tensor_inputs = block_state.callback_on_step_end.tensor_inputs

		logger = logging.get_logger(__name__)


		class Cosmos3DecodeStep(ModularPipelineBlocks):

Uh oh!

Conversation

yzhautouskay commented Jul 2, 2026

What does this PR do?

Summary

Test Plan

Before submitting

Who can review?

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants